553 research outputs found
Question Type Guided Attention in Visual Question Answering
Visual Question Answering (VQA) requires integration of feature maps with
drastically different structures and focus of the correct regions. Image
descriptors have structures at multiple spatial scales, while lexical inputs
inherently follow a temporal sequence and naturally cluster into semantically
different question types. A lot of previous works use complex models to extract
feature representations but neglect to use high-level information summary such
as question types in learning. In this work, we propose Question Type-guided
Attention (QTA). It utilizes the information of question type to dynamically
balance between bottom-up and top-down visual features, respectively extracted
from ResNet and Faster R-CNN networks. We experiment with multiple VQA
architectures with extensive input ablation studies over the TDIUC dataset and
show that QTA systematically improves the performance by more than 5% across
multiple question type categories such as "Activity Recognition", "Utility" and
"Counting" on TDIUC dataset. By adding QTA on the state-of-art model MCB, we
achieve 3% improvement for overall accuracy. Finally, we propose a multi-task
extension to predict question types which generalizes QTA to applications that
lack of question type, with minimal performance loss
Question Type Guided Attention in Visual Question Answering
Visual Question Answering (VQA) requires integration of feature maps with drastically different structures. Image descriptors have structures at multiple spatial scales, while lexical inputs inherently follow a temporal sequence and naturally cluster into semantically different question types. A lot of previous works use complex models to extract feature representations but neglect to use high-level information summary such as question types in learning. In this work, we propose Question Type-guided Attention (QTA). It utilizes the information of question type to dynamically balance between bottom-up and top-down visual features, respectively extracted from ResNet and Faster R-CNN networks. We experiment with multiple VQA architectures with extensive input ablation studies over the TDIUC dataset and show that QTA systematically improves the performance by more than 5% across multiple question type categories such as “Activity Recognition”, “Utility” and “Counting” on TDIUC dataset compared to the state-of-art. By adding QTA on the state-of-art model MCB, we achieve 3% improvement in overall accuracy. Finally, we propose a multi-task extension to predict question types which generalizes QTA to applications that lack question type, with a minimal performance loss
Differentially Private Bias-Term only Fine-tuning of Foundation Models
We study the problem of differentially private (DP) fine-tuning of large
pre-trained models -- a recent privacy-preserving approach suitable for solving
downstream tasks with sensitive data. Existing work has demonstrated that high
accuracy is possible under strong privacy constraint, yet requires significant
computational overhead or modifications to the network architecture.
We propose differentially private bias-term fine-tuning (DP-BiTFiT), which
matches the state-of-the-art accuracy for DP algorithms and the efficiency of
the standard BiTFiT. DP-BiTFiT is model agnostic (not modifying the network
architecture), parameter efficient (only training about of the
parameters), and computation efficient (almost removing the overhead caused by
DP, in both the time and space complexity). On a wide range of tasks, DP-BiTFiT
is faster and uses less memory than DP full
fine-tuning, even faster than the standard full fine-tuning. This amazing
efficiency enables us to conduct DP fine-tuning on language and vision tasks
with long-sequence texts and high-resolution images, which were computationally
difficult using existing methods
Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer
Pretrained transformer models have demonstrated remarkable performance across
various natural language processing tasks. These models leverage the attention
mechanism to capture long- and short-range dependencies in the sequence.
However, the (full) attention mechanism incurs high computational cost -
quadratic in the sequence length, which is not affordable in tasks with long
sequences, e.g., inputs with 8k tokens. Although sparse attention can be used
to improve computational efficiency, as suggested in existing work, it has
limited modeling capacity and often fails to capture complicated dependencies
in long sequences. To tackle this challenge, we propose MASFormer, an
easy-to-implement transformer variant with Mixed Attention Spans. Specifically,
MASFormer is equipped with full attention to capture long-range dependencies,
but only at a small number of layers. For the remaining layers, MASformer only
employs sparse attention to capture short-range dependencies. Our experiments
on natural language modeling and generation tasks show that a decoder-only
MASFormer model of 1.3B parameters can achieve competitive performance to
vanilla transformers with full attention while significantly reducing
computational cost (up to 75%). Additionally, we investigate the effectiveness
of continual training with long sequence data and how sequence length impacts
downstream generation performance, which may be of independent interest.Comment: The 2023 Conference on Empirical Methods in Natural Language
Processing (EMNLP 2023 Findings
Better Context Makes Better Code Language Models: A Case Study on Function Call Argument Completion
Pretrained code language models have enabled great progress towards program
synthesis. However, common approaches only consider in-file local context and
thus miss information and constraints imposed by other parts of the codebase
and its external dependencies. Existing code completion benchmarks also lack
such context. To resolve these restrictions we curate a new dataset of
permissively licensed Python packages that includes full projects and their
dependencies and provide tools to extract non-local information with the help
of program analyzers. We then focus on the task of function call argument
completion which requires predicting the arguments to function calls. We show
that existing code completion models do not yield good results on our
completion task. To better solve this task, we query a program analyzer for
information relevant to a given function call, and consider ways to provide the
analyzer results to different code completion models during inference and
training. Our experiments show that providing access to the function
implementation and function usages greatly improves the argument completion
performance. Our ablation study provides further insights on how different
types of information available from the program analyzer and different ways of
incorporating the information affect the model performance.Comment: 12 pages. Accepted to AAAI 202
On the accuracy and efficiency of group-wise clipping in differentially private optimization
Recent advances have substantially improved the accuracy, memory cost, and
training speed of differentially private (DP) deep learning, especially on
large vision and language models with millions to billions of parameters. In
this work, we thoroughly study the per-sample gradient clipping style, a key
component in DP optimization. We show that different clipping styles have the
same time complexity but instantiate an accuracy-memory trade-off: while the
all-layer clipping (of coarse granularity) is the most prevalent and usually
gives the best accuracy, it incurs heavier memory cost compared to other
group-wise clipping, such as the layer-wise clipping (of finer granularity). We
formalize this trade-off through our convergence theory and complexity
analysis. Importantly, we demonstrate that the accuracy gap between group-wise
clipping and all-layer clipping becomes smaller for larger models, while the
memory advantage of the group-wise clipping remains. Consequently, the
group-wise clipping allows DP optimization of large models to achieve high
accuracy and low peak memory simultaneously
- …